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Abstract 

We investigate the order of the r-th, 1 < r < +oo, central moment 
of the length of the longest common subsequence of two independent 
random words of size n whose letters are identically distributed and 
independently drawn from a finite alphabet. When all but one of the 
letters are drawn with small probabilities, which depend on the size of 
the alphabet, a lower bound is shown to be of order n*"/^. This result 
complements a generic upper bound of order n^/^. 
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1 Introduction and results 

Let X = (Xj)j>i and Y = (Fj)i>i be two independent sequences of lid random 
variables taking values in a finite alphabet Am = {(^i,Ci2,-- - i^m}, with 
P(Xi = ak) = P(Fi = afc) = pfc, A; = 1, 2, ■ ■ ■ , m. Let LC„ be the length 
of the longest common subsequence of Xi---Xn and Yi ■ ■ ■ F^, i.e., LCn 
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is the largest k such that there exist 1 < ii < i2 < ■ ■ ■ < ik 1^ n and 
1 < ji < j2 < ■ ■ ■ < jk < n, such that X^^ = Yj^,Xi,_ = Yj^,--- , X^^ = F^^. 

The study of the asymptotic behavior of LC„ has a long history starting 
with the well known result of Chvatal and Sankoff [5] which asserts that 

lim = 7*. (1.1) 

n— 5>oo n 

However, the exact value of 7*, which depends on the distribution of Xi 
and on the size of the alphabet, is still unknown even in "simple cases" such 
as for uniform Bernoulli random variables. This first asymptotic result was 
sharpened by Alexander in [T] and [2] where the speed of convergence to 7* 
in ( II. ip is investigated, and where it is shown that. 



7 n 



Cv/^n^ < ELC„ < 7*n, (1.2) 



where C > is a constant depending neither on n nor on the distribution 
of Xi. Next, as far as the order of the variance is concerned, Steele [H] 
first proved that VarLCn < n, but finding the order of the lower bound 
is more illusive. In various instances, where there is some "bias" such as 
for an asymmetric scoring function or highly asymmetric Bernoulli random 
variables, the lower bound is shown to be of order n ([1], [6], [7]). This is 
also the case if the sequences are iid uniform and contain sparse long blocks, 
a situation which, in some sense, is as close as we want to the iid uniform 
one (see [3]). In contrast to [6] or which deal only with binary words, 
our results are proved for alphabets of arbitrary fixed size and are thus novel 
in that context even for the order of the variance. Moreover, these results 
are no longer asymptotic and furthermore provide precise constants sharply 
depending on the alphabet size. 

We investigate next the r-th central moment of LCn, when all but one of 
the letters are drawn with small probabilities, and prove: 

Theorem 1.1 Let 1 < r < +00, and let (Xj)j>i and (l^)j>i be two indepen- 
dent sequences of iid random variables with values in Am = «2, ■ ■ ■ , c^m}, 
and with P(Xi = ak) = Pk, k = 1,2,--- ,m. Let pj^ > 1/2, for some 
jo G {I,-- - ,m} and let maxjjijgPj < K/m, where K = 2~^3e~'''^. Then 
there exists a constant C > depending on r, m, pj„ and maxj^j^pj, but 
independent 0/ n G N, such that, 

Mr{LCn) := E \LCn - ELCnl'' > Cn^- (1-3) 
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An estimate on the constant C above is provided in Remark 12.11 

The variance upper bound obtained by Steele ([H]) rehes on an asymmetric 
version of the Efron-Stein inequahty which can be viewed as a tensorization 
property of the variance. The symmetric Efron-Stein inequahty has seen 
generahzation, due to Rhee and Talagrand [8], to the r-th moment, where it 
is, in turn, viewed as a consequence of Burkholder's square function inequal- 
ity. In the asymmetric case, a similar extension is also valid as explained 
next. First, let S" : M"' — )■ M be a Borel function and let also (2'j)i<j<„ and 
(Zj)i<j<„ be two independent families of iid random variables having the 
same law. Now, and with suboptimal notation, let 



S — S{Zi, Z2, 



) Zn), 



and let 



Si — S{Zi, Z2, ■ ■ ■ , Zi_i, Zi, Zi^i, 
1 < i <n. Then, for any r > 2, 



) Zn), 



\\S-ES\\r := (E|^-E^r 



< 



r — 1 



21/r 



1/2 



2 

« llr 



:i.4) 



, i=l 



Indeed, for z = 1, ■ ■ ■ , n, let J-^ := cr(Zi, ■ ■ • , Zi), let J-q be trivial, and let 
di = £(5*1 J-i) — E(S'| Thus, {di, Fi)i<i<n forms a sequence of martingale 

differences, and from Burkholder's square function inequality, with optimal 
constant, for r > 2, 



\S-W. 



i=l 



< (r 



< (r 




:i.5) 



Moreover, and as in [8], for each 1 < i < n, letting Sj = (j{Zi, Z2, ■ ■ 

E|5-5,r = E(E(|5-5ir|Si)) 

> E(|E(5|S,) - E(5| + E{Si\Ti.r) - E(5,|S, 
:=E|?7 + 1/r, 



; Zi, Zi), 



:i.6) 
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where U = E{S\Ei) - E{S\J^i^i) and V = E{Si\J^i^i) - E{S,\Ei). But, given 
U and V are independent, with moreover E{U\J^i^i) = E(y\J^i^i) = 
and E\U\'' = E\V\'' = E\di\\ Thus 

E\U + 1/r = E{E{\U + yri-Fi-i)) > E|f/r + E\V\' = 2E\di\''. (1.7) 

Combining ([L5]), ^Ml and ([LTD gives fOD . 

Let us now apply (11.41) to LC„ viewed as a function of Xi, ■ ■ ■ , Xn, ^l, ■ " " ? ^n- 
First, note that replacing Xi (resp. Yj) by an independent copy of itself, im- 
plies that \LCn-LCn{Xi, ■■■ ,Xi,-- ■ ,Yn) \ (resp. |LC„-LC„(Xi, ■ ■ ■ ■ ■ ■ ,1^„)|) 
is always at most 1. Thus, for any r > 2, 

M,(La,) < ^^^^(2n)i, (1.8) 

which further yields, 

Mr{LCn) < n5, 

for any < r < 2. 

As far as the content of the rest of paper is concerned, in Section|2]we give 
a proof of Theorem 11.11 which relies on a key preliminary result. Theorem 
12. H whose proof is given in Section [31 



2 Proof of Theorem 1.1 



Throughout the paper, by finite sequences X and Y of length n, it is meant 
that X = {Xi)i<i<n and Y = (Fi)i<i<n. 

The strategy of proof to obtain the lower bound is to first represent LC„ 
as a random function of the number of aj^ in the sequences. This random 
function satisfies locally a reversed Lipschitz condition, as n goes to infinity, 
and this ultimately gives the lower bound in Theorem 11.11 To start, and as 
in [6], pick a letter equiprobably at random from all the non-ajg letters in 
either one of the two finite sequences, of length n, X and Y. Next, change 
it to the most probable letter aj^ and call the two new finite sequences X 
and Y. Then^he length of the longest common subsequence of X and Y, 
denoted by LC„, tends, on an event of high probability, to be larger than 
LCn- This fact is obtained via the following theorem. 
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Theorem 2.1 Let the hypothesis of Theorem \l.l\ hold. Then, there exists a 
set Bn C A^^ X A^, such that for all n>l, 

P ((X, F) e S„) > 1 - 121 exp (^_ <^^^3^3.Vjf ^^ ^ (2.1) 

and such that for all (x, y) G Bn, 

F{LCn - LCn = l\X = x,Y = y) > K2, (2.2) 

F{LCn - LCn = ~l\X = x,Y = y)<^, (2.3) 
where K2 = Ki/m, with Ki = 2^'^3e-^'^ . 

The proof of Theorem 12.11 is given in the next section, but we indicate 
next how it leads to the lower bound on Mf.(LC„) given in Theorem 11.11 
From now on, assume without loss of generality that pi > 1/2 and that 

P2 = ma.X2<j<mPj- 

We start with a few definitions. For the two finite random sequences 
X = {Xi)i<i<n and Y = (Yi)i<i<n, let Ni be the total number of letters ai 
present in both of them. Next, by induction, define a finite collection of pairs 
of random sequences (X'^, F'^)o<fc<2n as follows: First, let X° = (X°)i<j<„ 
and F° = (Fj°)i<j<.„ be independent, with X° and Y^, i = I,-- - ,n, iid 
random variables with values in {a2, ■ ■ ■ , Om} and such that F^X^ = ak) = 
P(Y]° = ak) = Pk/i)- —Pi), 2 < k < m. In other words, X° and F° are two 
independent finite sequences of iid random variables with for common law the 
law of (X, Y\Ni = 0). Once (X^ F^) is defined, let {X''+\ Y''+^) be the pair 
of finite random sequences obtained by taking with equal probability, one 
letter from all the letters 02, as, ■ ■ ■ , am in the pair (X'^, Y'') and replacing 
it with ai. Clearly, for 1 < k < 2n — 1, X^ and Y^ are not independent. 
Then, denote by LCn{k) the length of the longest common subsequence of 
X^ and Y^. Our first lemma shows that the law of (X'^, Y'') is the same as 
the law of (X, Y) conditional on Ni = k, and therefore the law of LCn{k) is 
the same as the conditional law of given Ni = k. 

Lemma 2.1 Let X = (Xj)i<j<„ and Y = (Vi)i<i<n- Then, for 
= 0, 1,-- - ,2n, 

^X'',Y'') = {X,Y\Ni = k), (2.4) 
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and therefore, 

{x^\Y^') = (x,r), 

where = denotes equality in law. 

Proof. The proof is by induction on k. By definition, Y^) has the same 
law as {X, Y) conditional on A''i = 0. For any (a^^, ■ ■ ■ , aj^^) G x A^, let 

q£ = \{l < i <'2n : aj^ = aj| , 

1 < i < m. Now assume (12 ■4p is true for k, i.e., assume that for any 
(ttji, ■ ■ ■ , ttjan) e A'^ X with Qi = k, 

(2.5) 

Then, for any (a^j, ■ ■ ■ , ajj^) G x A^, with gi = + 1, 

P ((X^\ ■ ■ ■ , ■ ■ ■ , = (a,,, ■ ■ ■ , a,,J) = 

fc+i 

J] P ((Xf+\ ■ ■ ■ , ■ ■ ■ , l^'^+i) = (a,,, ■ ■ ■ , a,,J +1) P(i?^i), 

(2.6) 

where B^^^, 1 < i < k + 1, is the event that the i-th ai in (a^^, ■ ■ ■ , aj2„) is 
changed from a non-ai letter when passing from {X'',Y^) to {X^^^,Y''~^^). 
Conditional on -Bf"*"^, the i-th ai in (a^j, ■ ■ ■ , a^-j^) could have been changed 
from any letter in {a2, as, ■ ■ ■ , am}? and assuming this ai has been changed 
from Q!s, 2 < s < m, the corresponding probability is given by: 

p ((A-, y-) ^ („,, . ^ Q " n (t^)" (r^) . 

where of course, above, takes the place of the i-th ai in the sequence 
,«j2n)- Thus, 

P ((Xf+\ ■ ■ ■ , ■ ■ ■ , y„^+i) = («,„ ■ ■ ■ , a,,J|Ef+^) P(S^i) 

fn\~^ f Pe \''' f p \ 1 
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which when incorporated into fl2.6p . gives 

p ■ ■ ■ , ■ ■ ■ , y::^^) = ■ ■ ■ , 

finishing the proof of the first part of the lemma. 

Next, from the above, and since Ni and {{X'' ,Y^)}o<k<2n are indepen- 
dent, for any G M" x 

2n 

k=0 
2n 

= ^E (e'<*'^'>+'<^'^'>) P(iVi = k) 

2n 

= ^E (^e'<*'^'>+'^<^'^'>|iVi = A;) P(iVi = k) 

2n 

= (^e'<t,x^->+^<s,y''^>\j^^ = A;) P(iVi = k) 

■ 

Thus {X^^, y^^) has the same law as (X, Y), and so LCn{Ni), the length 
of the longest common subsequence of (X^^, Y'^^), has the same law as LCn, 
and therefore, 

MriLCniNi)) = MriLCn). (2.8) 

To prove Theorem 11.11 we also need the following simple inequality valid 
for functions satisfying locally a reversed Lipschitz condition. 

Lemma 2.2 Let f : D ^ Z satisfies locally a reversed Lipschitz condition, 
i.e., let f be such that for any i,jED with j > i + i, 

where c and i are two positive constants, and let T be a D-valued random 
variable such that E|/(T)| < +00, then for any r >1, 

M,(/(r))> (^)'(M,(r)-r). (2.9) 
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Proof. Let T be an independent copy of T. Then, for any r > 1, 

M^(T) < E(|r-fn < 2'^M^(r), (2.10) 
which in turn imphes, 

M,,(/(T))>lE(|/(T)-/(f)n 

> (^y (e((t - fyi^^f^,) + E((f - ryif^^^,] 

> (|)''(M.(T)-r). 



Now, for any random variable U and random vector V with finite r-th 
moment, r > 1, let 

Mr(U\V) :=E(|f/-E(f/|l^)r|l^). 

Then, 

Mr{U) > ^E(Mr{U\V)), (2.11) 

which in our setting implies that, for each n > 1, 

M,(LC„(iVi)) > ^E{Mr{LC^{N,)\{LCn{k))o<k<2n)). (2.12) 
But, Ni is independent of {LCn{k))o<k<2n, and so from (12. lip . 

0<k<2n) 

> ^Mr{LCn{Ni)\{LCn{k))o<k<2n,m E /)P(iVi G /) , (2.13) 
where / is the interval 



/ 

Likewise, 



2npi - y^Ml^^]^^, 2npi + ^2n{l-pi)pi . (2.14) 



iLCn{Ni)\{LCn{k))o<k<2n,Ni G /) 

> ^Mr{LCrr{Ni)\{LCn{k))o<k<2n,m G /nO„)P(0„), (2.15) 



where for each n > 1, 



On-.-- 



fl |La(j)-^a,(0>^(j-^)}, (2.16) 



j>i+i{n) 



where K2 is given in Theorem 12.11 and where i{n) > is to be chosen later. 
In other words, on the event On the random function LCn{-) has a slope of 
at least -R'2/4 on the interval /, when i and j are at least i{n) away from 
each other. Combining (^A2\i . ( 1233|) and ( 1233|) leads to 

M,(LC„(iVi)) 

> ^E{Mr{LCn{N,)\{LCn{k))o<k<2n,m G / H 0„))P(iVi G /)P(0„,), 

(2.17) 



and it remains to estimate the three terms on the right hand side of fl2.17p . 
For the first one, from the very definition of the event On, applying Lemma 
12.21 and since A''i is independent of {LCn{k))o<k<2n, 



E{Mr{LCniN,)\{LCnik))o<k<2n,N^ einOn)) 

> I — 1 (M,(iVi|iVi el)- l{ny) . (2.18) 



Next, from the Berry-Esseen inequality, for all n > 1, 



P(iVi G /) 



'2ti 



e 2 



< 



^2npi(l 



(2.19) 



Moreover, 



lr{Ni\N^ G /) = EdA^i -E(iVi|iVi G I)Y\Ni G /) 

= E(|iVi - 2npi + 2npi - E(iVi|A^i G /)r|A^i G /) 

> |E(|A^i - 2npir|iVi G/)^/"- |2npi - E(iVi|A^i G/)||', 

(2.20) 
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and 

|E(A^i|iVi el)- 2npi\ 



^/2npi{l -pi) 



E 



/ A^i - 2npi 
\^j2npi{l -pi) 

Fr,{l) - $(1) + - $(-1) - - ^{x))d. 



^{N^ e I) 



<V2np,{l-p,) ^^^^^ 

^ 2 

" J'_^ e-'rdx/v^ - l/v/2npi(l - p,) ' 



(2.21) 



where Fn and $ are respectively the cumulative distribution functions of 
(A^i — 2npi) I ^y2npi(l — pi) and of a standard normal random variable. Like- 
wise, 



E{\Ni-2npi\'\Ni E I) 



> (2npi(l-pi)) 



P(iVi G /) 
r /^i |a;|''e"^(ix - 20r/ A/npi(l -pi) 



(2.22) 



Next, combining ([220]), (ESI} and f lX^ gives: 
M,(Ari|Ari e /) 



> 



{2npi{l -pi))2 



1 / /_;^ ja^Te "a — 2y^/y?2pi(r^^pi) 



e '2 + a/tt/ \/npi(l -pi) 
2 



J -^ e 2 dxl\p2^ — 1/ \/2npi{l — pi) 



(2.23) 



Finally, combining fl2.17p - fl2.23p with the estimate on P(0.„), proved in the 
next lemma, finishes the proof of the lower bound in (11. 3p provided that 
331ogn/K| < ^{n) < K^^Jn and that Theorem 12.11 holds true (i^4 is esti- 
mated in Remark 12. ip . 
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Lemma 2.3 Let p2 < K/m, with K = 2 ^10 '^^ , then for all n > 1, 

nOn) > 1 - (484v/^e^nexp + 2nexp (-^^^)) ■ (2-24) 

Proof. Let A„ := {{X,Y) G and let A'^ := ^ Then, 



V Vfce/ / / k€l k&I k€l ^ ^ ' 

(2.25) 

First, by Stirling's formula, for all G / and n>l, 



:= 7(A;,n,pi). 

Then for all A; G / and pi > 3/4 (which holds true since p2 < K/m), 

> min (^7 (^2npi - A/2n(l -pi)pi,n,pij ,7 (^2npi + v/2n(l - 

> — pi . (2.26) 



Combining this last inequality with fl2.25p , and using Theorem 12. ![ gives 
P ^ j j ^ 4v^e2nP(A^) < 484^e^nexp (^-^) • (2.27) 



Next, for each n > 1, let 



^^^^^ ,LCn{k + l)-LCn{k), when holds, 28) 
1, otherwise. 



Again, by Theorem 12. 



E(Afc+l|X^r'=) > ^. (2.29) 
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Next, for each = 0, 1, ■ ■ ■ , 2n, let J^k ■= (r{X^, 1"°, ■ ■ ■ , X^ F^), then 
(Afe — E(Afc|J^/,._i), J^,fc)i<fc<2n forms a sequence of martingale differences, and 
since — 1 < < 1, it follows from Hoeff ding's martingale inequality that, 
for any i < j, 

P f (A, - E(A,|^._0) < - .)] < exp (-^^^) • 

(2.30) 

Moreover, from (12. 29 p . 

k=i+l 

thus 



\k=i+l J \fc=j+l 

'2 



< exp ( -^^^ ) ■ (2.31) 



For each n > 1, let now 



j>i+£(n) 



then, from fl23T]) 



32 



(2.32) 

From the very definition of Afc in (|2:28|) . H^g/ ^ C and so that 
< 484v^e2nexp (^-^^ + 2nexp {^^^^^^ ■ (2-33) 
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Remark 2.1 The reader might wonder how to estimate the constant C in 
Theorem \l.l\ From (12.171) . letting n > + and choosing 



1 

1 



£(n) = e 2(rapi(l -pi))2 — 
it follows from fl2:T8|) . fICTD anc? (12:21) that: 

Letting 

Ci = 2-^-^'{r/^ - 1)(1 + r)-^e~^/^K;{l - p^Y^^, 

and 

^ . M,(La) 

C2 = mm J- — , 

then one can choose C = min(Ci,C2) in Theorem \l.l\ 



3 Proof of Theorem [O 

3.1 Description of alignments 

Let us begin with an example. Let ^3 = {1, 2, 3} and, say, let 

X = 1213131112, r = 1113121112. (3.1) 

One optimal alignment corresponding to the longest common subsequence 
(LCS) of X and Y is 



12 13 13 1112 

1 113 1 2 1112 ^ ' 



while another possible optimal alignment is 



12 1 3 13 1112 , . 

1 113 1 2 1112 ^ ' 

Comparing these two alignments, it is seen that the way the letters ai, be- 
tween aligned non-ai letters, are aligned is not important as long as a max- 
imal number of such letters ai are aligned. Hence in general we need only 
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describe which non-ai letters are ahgned and assume that between pairs of 
ahgned non-ai letters a maximal number of letters ai are aligned. In other 
words, we can identify the two alignments (13. 2p and (13. 3 p as the same. 

Next, call cells the parts of the alignment between pairs of aligned non-ai 
letters. For example, the alignment (13. 2 p can be decomposed into two cells 
C(l) and C(2) as 

C(l), ?;i=-l C(2), 112=0 



12 13 13 1112 . , 

1 113 1 2 1112 

where, moreover, Vi denotes the difference between the number of letters cti 
in the X-strand and the F-strand of the cell C{i). Note that any alignment 
can be represented as a finite vector of such differences. For the alignment 
(13. 2p . this gives the representation {vi,V2) = (—1,0). The same X and 
Y can have different optimal alignments thus different representations: For 
example, above, another optimal representation is via (^1,^2) = (0, —1): 

C(l), i'i=0 C(2), V2=-l 



12 13 13 1 11 2 

1 1 13 12 1112 ^ ' 

Let X = X1X2 ■ ■ ■ Xn and Y = Y1Y2 ■ ■ ■ 1^ be given. As just explained, to 
every optimal alignment corresponds a vector representation v := {vi, ■ ■ ■ ,Vk) 
showing the number of cells {k, here) in the alignment and the differences be- 
tween the numbers of letters ai in the X-strand and the y-stand of the cells. 
In every cell, the maximum amount of letters ai is aligned. On the other 
hand, for every v = (vi, • • ■ ,Vk) G corresponds a (possible empty) family 
of alignments. All of these alignments have the same pairs of aligned non-ai 
letters and between consecutive pairs of aligned non-ai letters, a maximal 
number of letters ai are aligned. Since the alignments corresponding to the 
same v can only differ in the way the letters cti are aligned inside the cells, we 
again identify all the alignments in the family associated with w as a single 
alignment. In other words, we identify each vector v with an alignment and 
vice- versa. 

Writing \v\ for the number of coordinates of v, i.e., \v\ = k, ii v & Z^, 
the alignment associated with v = [vi,--- ,Vk) G Z'^ can now precisely be 
defined: 
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Definition 3.1 Let k E N and let v = (fi,-- - ,Vk) E Z'^. Let 7r^(0) = 
^'^(0) = 0, and for < i < k — 1, let (7r^,(i + 1), u^^i + 1)) be the smallest 
{s,t) (where < (52,^2) indicates that si < S2 and ti < t2) such that 

the following three conditions are satisfied. 

1. TTyi^i) < s and Vv{i) < t; 

2. X^ = YtE {0L2, ■ ■ ■ ,0!^}; 

3. the difference between the number of letters ai in the interval [TTv{i),s] 
and is equal to fj+i. 

// no such (s, t) exists, then set ny{i + 1) = ■ ■ ■ = -K^ik) = 00 and u^ii + 1) = 
■ ■ ■ = I'vik) = 00 . 

In other words, above, 7ry{i),h'y{i) are the indices corresponding to the 
2-th aligned non-ai pair in v. The i-th cell C^(i) is the pair 



and the cell Cy{i) is called a Vi-ceW. 

With the above definition, we can then let the alignment v be any align- 
ment (provided one exists) such that the following three conditions hold: 

1. X^^(i) is aligned with Y^^(^i), for every i = 1,2, ■ ■ ■ , k; 

2. the number of aligned ai in the cell Cy{i), denoted by Sy{i), is the 
minimum number of letters ai present in either ■ ■ ■ 

or y^„(j_i)+i, ■ ■ ■ , Y^^^iy, 

3. after aligning X^^(fe) with Y'^„(fc), align as many ai as possible, and let 
that number be r^. 

From these definitions, for any v Ell^ , if an alignment corresponding to 
V exists, then vr^(/i:) < n and Vy{}z) < n, and such a w is called admissible. 
Let V denote the set of all admissible alignments, that is. 




V : 



(3.6) 
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Then for every v ^ V, the length of the common subsequence corresponding 
to this ahgnment is: 



AC, = \v\ + J2Sv{^+rv. (3.7) 



1=1 



Therefore the length of the longest common subsequence of X and Y can be 
expressed as: 

LCn = maxACy, (3.8) 
and an admissible alignment is optimal if and only if AC^ = LCn- 



3.2 The effect of changing a non-«i letter to ai 

Again, the main idea behind Theorem 12.11 is that, by changing a randomly 
picked non-ai letter into ai, the length of the longest common subsequence 
is more likely to increase by one than to decrease by one. More precisely, 
conditional on the event An = {{X, Y) G the probability of an increase 
of LCn is at least K2 while the probability of a decrease is at most K2/2. 
Let us illustrate this fact with an example. Let X and Y be given by, 

X = 112113112131, F = 131111111131, (3.9) 

with optimal alignment: 

C(l), vi=-2 

1 12 113 112 1 3 1 , . 

13 1 11 11 1113 1 ^ ' 



Above, there are 6 non-ai letters, X3, Xg, Xg, Xn, F2, ^11, and each one has 
probability 1/6 to be picked and replaced by ai. Next, X3,X6,X9 and Y2 
are not aligned. Moreover, since X3,X6,X9 are on the top strand which 
contains a lesser number of letters ai, picking one of them and replacing it, 
leads to an increase of one in the length of the LCS. On the other hand, 
since Xn and Yu are aligned in this optimal alignment, picking one of them 
could potentially (but not necessarily) decrease the length of the LCS by 
one. Finally, picking Y2 may only potentially increase the length of the LCS 
by modifying the alignment. In conclusion, in this example, by switching a 
randomly chosen non-ai letter into ai, the probability of an increase of the 
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length of the LCS is at least 1/2, while the probability of a decrease is at 
most 1/3. 

To prove Theorem 12. H we just need to prove that typically there exists 
an optimal alignment v such that: 

1. Among all the non-ai letters in X and F, the proportion which are on 
the cell-strand with a smaller number of letters ai is at least K2. 

2. Among all the non-ai letters in X and Y , the proportion which is 
aligned is at most K2/2. 

Formally, let f = (wi, ■ ■ ■ , Vk) G '^^ be admissible. For each 1 < i < /c, if 
Vi 7^ 0, let N~{i) be the number of non-ai letters on the cell-strand of Cy{i) 
with a lesser number of letters ai. 



K{^ = {^B^^''^'^^''^''---^''-'^ (3.11) 



while if Vi = 0, let A^^, (i) = 0. Then the total number of non-cti letters on 
the cell-strand with the smaller number of letters ai is 

H 

K:=J2Ki^)- (3.12) 

1=1 

Let Ni be the number of letters in the two finite sequences X and Y, and 
let 



N>i = J2n,. (3.13) 



1=2 



Next, let 



Bn '■= {{x,y) G X : there exists an optimal alignment v of (x, y) 

with n~ > K2nyi and 2|t>| < K2n>i/2}, 

where, above, n~ is the value of N~ corresponding to v and similarly for n>i. 
Clearly, Bn depends on K2. Letting An = {{X, Y) G Bn}, our goal is now to 
prove that for some K2 > 0, 

P(^n)>l-e-^^", 
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where ii'3 > is independent of n. 

To continue our proof, we need an optimal alignment with enough non- 
ai letters in the cell-strands with a smaller number of letters ai. However, 
for many optimal alignments, most cells are 0-cells, i.e., cells with the same 
number of letters ai on both strands. To resolve this hurdle, on an optimal 
alignment where most cells are 0-cells, some of the 0-cells are broken up in 
order to create enough nonzero-cells while at the same time, maintaining the 
optimality of the alignment after this breaking operation. Let us present an 
example. Take two sequences 

X = 1121131123, y = 112131113, 

one of their optimal alignment is 

C{1), 1)1=0 C(2), 1)2=0 



1 


1 


2 


1 


1 


3 1 


1 


2 3 


1 


1 


2 


1 3 


1 


1 


1 


3 



(3.14) 



where both cells C(l) and C(2) are 0-cells. Now in cell C(2), Xq and Y5 are 
only one position away from being aligned. Thus aligning them, instead of 
the pair X^ and Yq, breaks the cell C(2) into two new cells C(2) and C*(3), 
with V2 = 1 and = —1. The new optimal alignment is then: 



C(l), vi=0 C{2), V2=l C{3), V3=-i 



112 113 11 2 3 
112 1 3 111 3 



(3.15) 



The advantage of breaking up a 0-cell is that the newly formed cells have 
different numbers of letters a\ on each strand, thus tends to increase in 
this process while the length of the common subsequence remains the same. 
After applying this operation and getting enough cells with different numbers 
of letters ol\ on the two strands, there is a high probability to find enough 
non-tti letters on the strand with a smaller number of letters a\. 

The previous example leads to our next definition. 

Definition 3.2 Lei /c G N, w G Z'' fl V . Let C^{i) be any cell with Vi = 0, 
1 < i < k. Then, C^{i) is said to he breakable if there exists j and j' such 
that: 

1. Xj = Yy e {^2, ■ • • ,am}; 
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2. Tiy^i — 1) < j < 7r,„(i) and u^ii — 1) < f < I'vif); 

3. the difference between the number of letters ai in 

is plus or minus one. 
3.3 Probabilistic developments 

After the combinational analysis of the previous sections, let us now bring 
some probabilistic tools. We start by introducing a useful way of constructing 
alignments corresponding a given vector v = {vi, ■ ■ ■ , Vk) € 'R^. 

For 1 < i < n and 2 < j < m, let i?^ (resp. 5*/) be the number of 
letters aj between the [i — l)-th ai and the i-th ai in the infinite sequence 
{Xi)i>i (resp. (li)j>i), with the convention that R{ (resp. Si) is the number 
of letters a-,- before the first ai. 

Recall also from Definition 13.11 that to construct a 0-cell, we use the 
random time Tq, where 

^0 = min TJ, (3.16) 

2<j<m. 

where := min{i = 1, 2, ■ ■ ■ : R{ 0, Sj ^ 0}. For a -u-cell (u > 0), the 
random time 

= min Ti„, (3.17) 

2<j<m 

where Ti^ '■= min{z = 1, 2, ■ ■ ■ : i?^ 7^ 0, Sj_^_^ 7^ 0}, and for a w-cell (u > 0), 

T„ = min Ti, (3.18) 

2<j<m 

where Tl := min{i = 1, 2, ■ ■ ■ : Rj^^^ 7^ 0, S'^ 7^ 0}. In other words, a cell with 
Vi = u can be constructed in the following way: First keep the first u letters 
«! in the X strand, then align consecutive pairs of ai until meeting the first 
pair of non-ai letters. 

Let us look at the distribution of i?^, and to do so, let 

m 
J=2 
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be the total number of non-ai letters between the (i — l)-th ai and i-th ai. 
Then, Rf^ + 1 has a geometric distribution with parameter pi. 

F{R>' = k) = {1 - pfy,, 

k = 0, 1,2, Moreover, conditional on Rf^, {Rj)^^2 has a multinomial 
distribution and therefore 

oo 

F{R{ = k) = Y^nm = = mRf' = o 

l=k 

P' "^f > (3.19) 



l=k 



Pi +PjJ \Pl+Pj 



for k = 0,1,2, ■■■ . Thus, Rj + 1 has a geometric distribution with parameter 
Pi/iPi+Pj), '2<j<m. 

We continue our probabilistic analysis by providing a rough lower bound 
for the length of the LCS. First, aligning as many letters ai as possible in X 
and Y, would get approximately a common subsequence of length npi, then 
aligning as many letters 02 as possible without disturbing the already aligned 
«!, would give an additional min{i?^, S'^^} aligned 02. Moreover, since 
Rf and Sf are independent geometric random variables, min{i?^, 5*^^} + 1 is a 
geometric random variable with parameter 1 — (p2/(Pi + ^2))^- So on average 
the aligned letters a2 contribute to the length of the LCS: 

Pi{pi + 2p2) Pi + 2p2 
This heuristic argument leads to the following rigorous lemma: 

Lemma 3.1 Letpi > 1/2, and let Di := {LCn > npi + ((1 — ^2)'^ ~ P2) np\}. 
Then, 

P(Di) > 1 - 4exp(-2np^) - exp {n{pl + log(l - pl)){p^ - pD) . 
Proof. For 5 > 0, let 



Dm ■■- 



1=1 



< 6n 
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^l{Y,=a,} - npi 



< 5n 



i=l 



and let 

D,{5) ■.= D^,i5)nDl{6). 

Therefore on D2{S), at least ni{6) := n{pi — 6) letters ai can be aligned. 
Now, min{i?f , Sf} + 1 has a geometric distribution with parameter 1 — 
{p^/ipi +^2))^- But, if Qi, - ■ ■ ,Gn are iid geometric random variables with 
parameter p, then for any P < 1, 



P ( < ) < exp {-{(3 - 1 - log(3)n) 



(3.20) 



.1=1 



and so for any (3 < 1, 



P 



Let 



(5) 



min{/?^ S^} 



< 



\ 



1=1 



P2 



P1+P2 



< e-C/^-i-iog/?)-!^. (3.21) 



ni(5) 



Ds{P,5) := { 5] min{i?^ 5f} > 



i=l 



P2 



P1+P2 



Choosing 5 = p\ and [i = 1 — p\, and when D2{6) and D-^{P,6) hold. 



LCn > 



P2 



P1+P2 

ni(5) 



P2 



PI+P2 



ni{5) + ni{5) 



ni{6)pl 



P2 



2 PI-P2 

np2- 



'{pi +P2y -pI 

> npi + ((1 - p2f - P2) npl 
By Hoeff ding's inequality, for any 5 > 0, 



+ n{pi - pI) - npl 



P1+P2 ^ 

P2 



P1+P2 



P (^2 (^)) > 1 - 2e-2"''' , P > 1 - 2e 



-2n(52 
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but since D2{pl) fl ^3(1 — pi, pi) C Di, it follows that 

FiD,) > 1 -Aexp{-2npl) - exp + log(l - - pD) . 

■ 

To state our next lemma, let us introduce some more notation. First, let 
V{k) := {{vi,V2r-- ,Vk) G l^^il + ■ ■ ■ + < 2k} , (3.22) 

then recalling that V as defined in (13. 6p is the set of admissible alignments, 
let 

P:= U [vf^V{k)). (3.23) 

2k>np^ 

With these definitions, the previous lemma further yields: 

Lemma 3.2 Let D = {v E P : v encodes an optimal alignment} , and let 
P2 < 1/10, then 

F{D) > 1 - 5exp (^-^^ 

Proof. Let be the number of letters ai in X = (Xj)i<j<„, and Nf be the 
corresponding number in y = (l^i)i<i<n, so that A^^i = A'^f + iVf is the number 
of letters ai in X and Y. From the proof of the previous lemma, we see that 
when .02(^2) holds, A^i is upper bounded by 2n(pi +pl) and therefore on Di, 
and provided that p2 < 1/10, 

LCn>^- npl + ((1 - p2f - P2) npl>^ + Kipl (3.24) 
Now assume that v = {vi, - ■ ■ ,Vk) G is an optimal alignment, then 

TV 1 ^ 

LC^<^--Y,h\ + k, (3.25) 

1=1 

which when combined with fl3.24p yields 

k 

\vi\ < 2/c and npl < 2k, 

i=l 

which finishes the proof. ■ 

The above lemma indicates that, with high probability, any optimal align- 
ment belongs to the set P. Hence, proving a property of the optimal align- 
ment essentially only requires proving it for alignments in P. 
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3.4 Events with high probabihty 



Recall from Definition 13.11 that to v G Z'^ is associated an alignment which 
has \v\ cells Ct,(l), ■ ■ ■ ,Ct,(|i;|), and that a cell is called a nonzero-cell if it 
contains different numbers of letters ai on the X strand and Y strand. Let 
W be the subset of P, consisting of the alignments for which the proportion 
of the nonzero-cells is at least 9, i.e., 

W ■= {v e P ■.\{ie[l,k]: Vi ^Q}\>ek}, 

and let W := P\W . 

Let us now define some relevant events, which are of great use to finish 
our proof. 

• Let be the event that, among the zero-cells in C^(l), ■ ■ ■ 
the proportion which are breakable is at least 9. Then, let 

E := fl 

i.e., E is the event that for all v G W^, the proportion of breakable 
zero-cells is at least 9. 

• Recall also from fl3.12p and fl3.13p . that A'^ is the number of non-ai 
letters on the cell strands with a lesser number of ai, and that A^>i is 
the total number of non-ai letters in X and Y . Let 

F := fl := fl {iV" > K^N^,}, 

i.e., F is the event that for every v G W, the proportion of non-ai 
letters which are on the cell-strand with the smaller number of letters 
ai, is at least K2. 

• Let 

i.e., G is the event that for every v G W, the proportion of non-ai 
letters which are aligned is at most A'2/2. 
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Recall finally from Section [3.2^ that A„ = {{X,Y) G is the event 
that there exists an optimal alignment v such that A^^ > A'2A^>i and 2\v\ < 
-f^2^>i/2, and therefore 

DnEnFnGcAn. (3.26) 

Our next task is to prove that there exists K2 > such that the events 
E, F, G hold with high probability. Let us start with E. 

Lemma 3.3 For any < 9 < 1, 
F{E)>1- exp(-(2(l-0)(^^-^) -log/(^)|A:|, (3.27) 



2fc>np2 

where 

m 



6^ \ 2 \i-e 



Proof. For any v G W^, let us compute the probability that a 0-cell in the 
alignment associated with v is breakable. Recalling the definition of Tq in 
fl3.16p . for 2 < j < m, let Mj be the event that this cell ends with a pair of 
letters aj, and so when Mj holds Tq = Tq. For 2 < j < m, let 

[/^ :=min{z = 2,3,---: 7^ 0, Si, = 0, = 0, Sl^O}, 

Ui:=mm{z = 2,3,---: li>_, = 0, ^ 0, ^ 0, 3^ = 0}, 

:= mm{Uiui}. 

With the above constructions, conditional on the event Mj, if < Tj 
then this 0-cell is breakable, thus to lower bound the probability that this 
0-cell is breakable, it is enough to lower bound ¥{U^ < Tq). To do so, let 
first (Z/)j=i 2,... be the independent random vectors given via: 



Then, let 



t/^' = min{z = 1,2,--- : G fii U ^2}, 

Tq^' = min{i = 1, 2, ■ ■ ■ : Zj e B3U B^}, 
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where 



5i := N* X {0} X {0} X N*, ^2 = {0} x N* x N* x {0}, 
^3 := N* X N* X N X N, ^4 := N x N x N* x N*, 



and where as usual N is the set of nonnegative integers, while N* = N\{0}. 
Clearly, 

2W > W, 2fj - 1 < TJ, 

thus 



F{W < Ti) > F{2W < 2fj - 1) = r{W < fj). 

Since the random variables (Z/)jgN. are iid, and since B1UB2 and B3U54 
are pairwise disjoint. 



P(Z/ e 5i u B2) + ¥{ZI e 53 u B^) 
> pI 



i + pf 



Therefore, 

m 

P(a 0-cell is breakable) = J]^P(a 0-cell is breakable |Mj)P( M,- 



J=2 



i=2 



1+pi 

Let J be the index set of 0-cells in the alignment associated with v G W^, 
and so \J\ > (1 — 6')|w|. For each i E J, let Jj be the Bernoulli random 
variable which is one if the cell Cv{i) is breakable and otherwise. Recall 
that Ey is the event that the proportion of the breakable cells in v is at least 
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6. Then from Hoeffding's inequality, 

ViGJ \ieJ / \ieJ ) ) 

<exp(|-2(l-^)H (y^-^)' 

Recall the definition of V(}z) in (13. 22 p . and let 

W\k) := W^^V{k). 
From Stirling's formula, for any two integers i and g£, with < g < 1, 

<g-'?'(l-g)-('-'?'), (3.28) 



which when combined with simple estimates yield. 



9^ \ 2 / Vl-^ 



(3.29) 



Let 

Eik)= fl 

then 

F{{E{k)Y)< J2 Pra<exp(-(2(l-^)('-^-^') -log/(^))A; 
which further gives 

2fc>np| 
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Of course, above, one wants 



/ 2 \ 2 

2(l-^)(Y^-^j -log/(^)>0, (3.30) 

and choices for which this is satisfied are given later. 

Let u be a nonnegative integer. For any — u-cell ending with an ahgned 
pair of letters aj (the event Mj holds for this cell), let r^(/) be the index of 
the /-th such that i?^' ^ 0, z.e., 

ri(l)=min{z>l: 0}, 

and for any / > 1, r^(/ + 1) = min{z > r^(/) : R?^ ^ 0}. 
Let 

p^--:=min{/ = l,2,...: S^,^,^ 7^ 0}. 

Hence fP~ is the number of nonzero values taken by W = {Rl)i<i<s (where 
s is the number of letters ai in the X-strand) in the cell (including the last 
one corresponding to the aligned pair of letters aj). Since X and Y are 
independent. 



Pi \ Pj 



Pi + Pj J Pl+ Pj 



(3.31) 



for k = 1,2, ■■■ . Thus, p^'~ has a geometric distribution with parameter 
Pj ~ Vj/ipi +Pj)5 2 < j < m. When —u < 0, the number of letters aj in the 
X-strand (which is the strand with the smaller number of letters ai) is at 
least p^'~ — 1 and, as shown in the next result, this provides a lower bound for 
(the number of non-ai letters on the cell-strand with the lesser number 
of letters ai) in this — w-cell. 

Lemma 3.4 Let K2 = Ki/m, where Ki = 2~^3e~^^, then for any pi > 
1 _ 2-2e-67, 

P(F) > 1 - 38exp(-3np^/200). 
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Proof. For any v G W, let now J be the index set of the nonzero cells of the 
alignment corresponding to v. Hence, \J\ > 6\v\. Then, 

1=1 iSJ iSJ 

where is the index of the last aligned pair of letters a-,- in the cell Cy{i), 
and Pi''*'*'" is the number of nonzero values taken by i?-''^*^ = {R\^'^'')i<i<t{i) 
is the number of letters «i in the X-strand of Cy{i)) in the X-strand of 
the cell C^(i) (or nonzero values taken by S'-'*-*^ in the K-strand, depending 
on which cell strand has a lesser number of letters ai). From fl3.31l) . pi^^^~ is 
a geometric random variable with parameter Pj{i)- Now, let e > 0, let again 
P2 =P2/{pi +P2), and let 



Then, 




(3.32) 



Since for i & J, the geometric random variables are independent and 

each with parameters < p2, using f l3.20p . it follows that 

< exp ((1 + \og{e/e + 2p2)) e\v\) , (3.33) 
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where the Qi are iid geometric random variables with parameter p2- 
Let 

F^{k):= fl Fi,, = |iV->^|i;|| andletFi:= f| F,{k). 
From the very definition of V{k) in fl3.22p . and using fl3.28p . 

which when combined with f l3.33p leads to 

P(Fi(A;)) > 1 - exp {k log(27/2) + k {1 + \og{e/9 + 2^2)) 0) . (3.34) 
Of course, one wants 

log(27/2) + (1 + log(£/^ + 2^2)) ^ < 0. (3.35) 
Choosing 9 = 1/25 and e = 10~^e~^^, then for any pi > I — 2~^e~^^, 

mFi{k)y) <e-3'=/^°°, 

and so 

P(i^i') < Yl ^ 34exp(-3np^/200). 

2fc>np| 

Note also that for these choices of 9 and pi, fl3.30p is satisfied and so E also 
holds with high probability. 

From the proof of Lemma \37l\ when D2{{1 — pi)) holds, the total number 
of non-tti letters in X and Y is at most 4?7.(1 —pi)- Thus iV>i < 4n(l —pi), 
and so when Fi fl -D2((l — Pi)) holds, for every v G W, 

N~ ^ e\v\ ^ e npj ^ ep2 ^ _J_ ^ ^ 

Nyi ~ p24n(l — Pi) ~ p24:n{l — pi) 2 ~ 16(1— pi) ~ 16m 

Therefore, 

F{F^) < F{F^) +F{{D2il~pi)y) < 34exp(-3np^/200) +4exp(-2n(l -pi)^) 

< 38exp(-3np^/200). 
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Lemma 3.5 Let K2 = Ki/m. Then, for any p2 < 2 ^K2, 

¥{G) > 1 -4exp(-np^/2). 

Proof. For any v G W, let (7^,(1), ■■■ ,C^,(|f|) be the corresponding cells. 
If the cell Cv{i) ends with a pair of aligned aj, 2 < j < m, then let pj^^^ 
be the number of nonzero values taken by i?-'^*^ in Cy{i). If Vi < 0, by the 
same argument as in getting f l3.3ip . p^''*'' has a geometric distribution with 
parameter If Vi > 0, there exists a geometric random variable p^^*"*' with 
parameter such that p]*"*^'" < pj^^^ < pi^"^' +Vi. Let iV^i (resp. N^^) be 
the number of non-ai letters in X (resp. Y), so that A^>i = N^^ + N^^. Let 

Gf, := l\v\ < ^iV>i| and := k| < ^N^i] , 



and so 
Since 

and for p2 < 2~'^e~^K2 



G^ n G^ c G^. 



\v\ 
i=l 



p((G':r)<p|M>^Epf^ 

i=l 



ri 



l<'t<|?)|,-Ui<0 l<i<\v\,Vi>0 



<P|M>^E6- 



2 



l^'l -51 I 

e 



< exp(-4|t;|), 
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where the Qi are iid geometric random variable with parameter p2- Likewise, 

P((G:^r)<exp(-4|^|), 

and thus 

P((G.r) <2exp(-4|^;|). 

As previously, let 

G{k) := Pi G„ and G = Pi G{k), 

v&WnV(k) 2k>np'l 

then 

^{{G{k)Y) < \V{k)\2exp{-Ak) < 2exp{-k), 

and 

F(G'") < mG{k)y) <Aexp{~npl/2). 

2k>npl 

■ 

Combining Lemma l3l2|l3.3[ I3^ and l3.5[ using fl3.26p . letting K2 = Ki/m, 
and ^ = 1 /25, it follows that for p2 < 2^'^e-^K2 = K/m, 

P(A^) < + F{E^) + P(F") + P(G") 

< 5exp (-^^ + 74e-^°"'"P2 + 38 exp(-3np^/200) + Aexp{-npl/2) 

< 121exp (^-^^ , (3.36) 
and this finishes the proof of Theorem 12.11 
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